Introduction

The purpose of this exercise is to perform exploratory data analysis (EDA) on Tweets related to the Shellshock.

The data has been downloaded using Tweeter's REST API and Tweepy python module and placed in a simple text file.

Each line in the input file is a JSON encoded Tweet related to the Shellshock vulnerability.

See Download Tweets for Shellshock.ipynb for the download part.

Most of the data munging is done in Python, but plotting is done using RMagic and ggplot2.

Why ? My R/dplyr skills are not as good as my python/pandas skills, and when it comes to charting nothing beats ggplot2.


Data Preparation

Load modules

In [ ]:
%matplotlib inline
%load_ext rpy2.ipython

In [ ]:
%%R
library(ggplot2)
library(gridExtra)
library(scales)
library(base)
print('Loaded R libraries')

In [3]:
import pandas as pd
import sys
import json
from datetime import datetime
import gzip
import time
from ggplot import *
import requests as r
from httpcache import CachingHTTPAdapter
from IPython.display import HTML, display_json, \
    display_html, display_javascript, JSON, Javascript
import re
import numpy as np
from urlparse import urlparse

# Some Pandas options TODO revisit this later
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth',2000)

# Cache http requests TODO confirm that this works
s = r.Session()
s.mount('http://', CachingHTTPAdapter())
s.mount('https://', CachingHTTPAdapter())
Load data from file

In [4]:
fName = "data/shellshockTweets.txt.gz"

tweets = []
st = time.time()
with gzip.GzipFile(fName, 'r') as fp:
    for tweet in fp:
        t = json.loads(tweet)
        # Parse 'created_at' into a proper datetime.
        t['created_at'] = \
            datetime.strptime(t['created_at'],
                              '%a %b %d %H:%M:%S +0000 %Y')
        tweets.append(t)
fp.close()
et = time.time()
print ('Loaded {0} tweets in {1:.2f} secs'.format(len(tweets),(et-st)))


Loaded 336108 tweets in 106.00 secs
Parse Tweet Jsons

In [5]:
# TODO pep8ify this code block

st = time.time()
tweetsDF = pd.DataFrame(tweets)
tweetsDF['user_loc'] = tweetsDF['user'].map(lambda u: u['location'])
tweetsDF['tweeted_to'] = tweetsDF['user'].map(lambda u: u['followers_count'])
tweetsDF['screen_name'] = tweetsDF['user'].map(lambda u: u['screen_name'])
tweetsDF['status'] = pd.isnull(tweetsDF['retweeted_status']).map(lambda s: 'original' if s else 'retweet') #Is it a retweet ?

# Take out the Link target from source.
regexp_link = r'''<a .+>(.+)</a>'''
pattern = re.compile(regexp_link)
tweetsDF['source'] = tweetsDF['source'].map(lambda x: pattern.findall(x)[0]) 

# Hashtags and URLs are burried deep in JSON, 
# extract them and convernt them to a comma separated string.
tweetsDF['hashtags'] = tweetsDF['entities'].map(
    lambda x: ' '.join(([ht['text'].lower().strip()
                         for ht in x['hashtags']])))
tweetsDF['urls'] = tweetsDF['entities'].map(
    lambda x: ' '.join(([ht['expanded_url'].strip()
                        for ht in x['urls']])))
tweetsDF['urlDomains'] = tweetsDF['entities'].map(
    lambda x: ' '.join((['{uri.scheme}://{uri.netloc}/'.format(uri=urlparse(
                        ht['expanded_url'].strip()))
                        for ht in x['urls']])))
et = time.time()
# A dedicated DF for unique Tweets.
uniqtweetsDF = tweetsDF[tweetsDF['status'] == 'original']
uniqtweetsDF.reset_index(inplace=True)
# for converting to R DataFrame
cols=['created_at','lang','user_loc','screen_name','retweet_count',
      'tweeted_to','status','source','hashtags','urls','urlDomains']

df = tweetsDF[cols]
udf = uniqtweetsDF[cols]
print ('Parsed tweets in {0:.2f} secs and found {1} unique tweets'.format((et-st),len(uniqtweetsDF)))


Parsed tweets in 24.15 secs and found 204985 unique tweets

In [6]:
%%R -i df -i udf

df$urls <- as.character(df$urls)
df$urlDomains <- as.character(df$urlDomains)
df$hashtags <- as.character(df$hashtags)

udf$urls <- as.character(udf$urls)
udf$urlDomains <- as.character(udf$urlDomains)
udf$hashtags <- as.character(udf$hashtags)

print('Loaded DataFrames in R')


[1] "Loaded DataFrames in R"
See a few sample Tweets if needed.

In [ ]:
# Sample a few Tweets
from IPython.display import display
sample=tweetsDF.loc[[27,446,9898]]
HTML(sample.T.to_html())

Plots

Trend Over Time

Let's start with some histograms and desnity plot to see the trends over time.

Stacked Distribution of Original Tweet & Retweets by day

This plot shows how the tweets (both Original and Retweets) trend over days.


In [27]:
%%R  -w 800 -h 800 -u px
plt <- ggplot(df, aes(x=created_at, fill=status, color=status)) +
        scale_x_datetime(breaks = date_breaks("1 day"),labels=date_format('%m/%d')) +
        scale_y_continuous(labels=comma) +
        geom_histogram(alpha=0.5,binwidth=24*60*60,drop=TRUE) +
        theme(axis.text.x=element_text(angle=90)) + 
        xlab('Date') + ylab('Tweets') + 
        ggtitle('All Tweets related to Shellshock per day\nStacked by Status\n') 

plt2 <- plt + 
        ggtitle('All Tweets related to Shellshock per day\nUnstacked\n')  + 
        facet_grid(. ~ status)

suppressMessages(g <- grid.arrange(plt,plt2,nrow=2))



Some Notes & Observations

Notes

  • The Twitter API doesn't give us all the tweets but only a fraction
  • So what we see above is the distribution of that fraction, and the API doesn't say whether the distribution of the fraction is representative of the distribution of the population.
  • But looking at the above graph and comparing it against Google Trend Graph for 'shellshock' shown below shows remarkable similarity.

Observations

  • The bulk of the activity occured on 2 days 9/25 (Thur) & 9/26 (Fri)
  • People tend to tweet significanly less on Sundays (9/28, 10/5, 10/12) as compared to other days (Well the infosec crowd anywasy)

Density Plots of Original Tweets and Retweets.

Notice how easy it is to compare the distributions using a density plot v/s the two histogram plots above


In [28]:
%%R  -w 800 -h 800 -u px
plt <- ggplot(df, aes(x=created_at,color=status)) +
        scale_x_datetime(breaks = date_breaks("1 day"),labels=date_format('%m/%d'))  +
       theme(axis.text.x=element_text(angle=90)) + 
       xlab('Time') + ylab('Density') + 
       ggtitle('All Tweets related to Shellshock per day\n')

plt1 <- plt +
        geom_density(alpha=0.2,aes(fill=status),drop=TRUE)

plt2 <- plt +
        geom_histogram(alpha=0.2,aes(y = ..density..,fill=status),drop=TRUE) + 
        geom_density(drop=TRUE) + 
        facet_grid(. ~ status)

suppressMessages(grid.arrange(plt1,plt2,nrow=2))


Tweets Distribution by Day for Top 6 Languages

Finally we look at how the Tweets trend over time by the Top 6 Languages


In [15]:
top_N_lang=df[df['lang'].isin(
    pd.value_counts(df['lang'])[:6].index)]

In [16]:
%%R -i top_N_lang -w 1024 -h 480 -u px
plt <- ggplot(top_N_lang, aes(x=created_at, fill=lang, color=lang)) +
        scale_x_datetime(breaks = date_breaks("1 day"),labels=date_format('%m/%d')) +
        scale_y_continuous(labels=comma) +
        geom_histogram(alpha=0.5,binwidth=24*60*60,drop=TRUE) +
        theme(axis.text.x=element_text(angle=90,hjust=1)) + 
        xlab('Date') + ylab('Tweets') + 
        ggtitle('All Tweets related to Shellshock per day\n') + 
        facet_wrap(~ lang,scales='free_y')
plt


  • Most of the tweeting happened in English, an order of magnitude higher than in Spanish which came in second followed by Japanese, German & French.

List of Top 20 Retweeted tweets

Before diving in to charts lets look at what the top 20 retweeted tweets were.

Looking at the data below, none of the tweet (bar the first one) was retweeted more than a few hundred times at max.

Infosec tweets don't attract the same kind of attention as what the flavor of the month celebrity had for breakfast.

 


In [17]:
# Top 20 retweeted tweets
top20Retweeted = uniqtweetsDF.sort(columns='retweet_count',ascending=False)[:20].reset_index()
# Display Top 20 Retweeted tweets using Tweeter's oembed API.
HTML('<div><h3>Top 20 Retweeted Tweets</h3><br/>' + \
     "".join(['<div> Retweeted '+"{:,}".format(row['retweet_count'])+' times:' + \
     r.get('https://api.twitter.com/1/statuses/oembed.json?id='+ \
           row['id_str']+'&align=center&omit_script=true').json()['html']+ \
     '</div>'
     for row in top20Retweeted[['id_str','retweet_count']].T.to_dict().values()]) + \
     '</div>')


Out[17]:

Top 20 Retweeted Tweets


Retweeted 11,632 times:
Retweeted 793 times:
Retweeted 689 times:
Retweeted 677 times:
Retweeted 562 times:
Retweeted 536 times:
Retweeted 513 times:
Retweeted 470 times:
Retweeted 470 times:
Retweeted 408 times:
Retweeted 393 times:
Retweeted 389 times:
Retweeted 379 times:
Retweeted 347 times:
Retweeted 332 times:
Retweeted 307 times:
Retweeted 306 times:
Retweeted 281 times:
Retweeted 278 times:
Retweeted 274 times:

Top N Bar Charts

Next we look at some Top N Bar Charts

Top 20 retweeted tweets and their owners.

This is the graphical representation of the top 20 retweeted tweets data above.


In [18]:
%%R  -i top20Retweeted -w 800 -h 600 -u px
labs <- paste(rev(reorder(top20Retweeted$screen_name,top20Retweeted$retweet_count)) , 
               rev(reorder(top20Retweeted$tweeted_to,top20Retweeted$retweet_count)), sep=':')
plt <- ggplot(top20Retweeted, aes(x=reorder(id_str,retweet_count), y=retweet_count,fill=-tweeted_to)) + coord_flip() +
        scale_y_continuous(labels=comma) +
        scale_x_discrete(labels=labs) +
        geom_bar(alpha=0.8, stat='identity') +
        xlab('User:Followers') + ylab('Times Retweeted') + 
        ggtitle('Top 20 retweeted tweets\n') 
plt



In [7]:
# Helper function to plot hashtags, urls.
def getAttrFreq(df, listKey, attrKey):
    all_attrs = pd.Series([attr for attrs in df[listKey]
                            for attr in attrs.split(" ")])
    all_attrs = all_attrs[all_attrs.map(len) > 0] #Discard blank ones
    all_attrs.reset_index(drop=True, inplace=True)

    freqDF = pd.value_counts(all_attrs).reset_index()
    freqDF.columns = [attrKey, 'count']
    return freqDF     

def getKeyFreq(df, key):
    freqDF = pd.value_counts(tweetsDF[key]).reset_index()
    freqDF.columns = [key,'count']
    return freqDF

In [8]:
%%R -w 800 -h 600 -u px 
plotTopNAttrs <- function(df,label,entity) {
    labs <- rev(reorder(df[,entity],df$count))

    plt <- ggplot(df, aes_string(x=paste('reorder(',entity,',count)',sep=''), y='count')) + coord_flip() +
           scale_y_continuous(labels=comma) +
           scale_x_discrete(labels=labs) +
           geom_bar(alpha=0.8, stat='identity') +
           xlab(entity) + ylab('Count') + 
           ggtitle(paste('Top ',length(df[,entity]),' ',label,'\n',sep='')) 
    return(plt)
}
Top 20 hashtags in All Tweets

No surprises here, #shellshock was the leading hashtag by a long margin.

What's surprising is 'cve' doesn't show up in top 20, which is a pity.


In [27]:
# Plot Top 20 hashtags in Tweets (including retweets)
top_20_hashtags= getAttrFreq(tweetsDF, 'hashtags', 'hashtag')[:20]

In [28]:
%%R -i top_20_hashtags -w 800 -h 600 -u px 
plotTopNAttrs(top_20_hashtags,'hashtags','hashtag')


Top 20 hashtags in Original Tweets

And even in just the original tweets, the leading hashtag was #shellshock.

And again CVE doesn't show up.


In [29]:
# Plot Top 20 hashtags in unique Tweets.
top_20_hashtags= getAttrFreq(uniqtweetsDF, 'hashtags', 'hashtag')[:20]

In [30]:
%%R -i top_20_hashtags -w 800 -h 600 -u px 
plotTopNAttrs(top_20_hashtags,'hashtags','hashtag')


Top 10 Sources of Tweets

Next we look at Top 10 Sources for the tweets. Source denotes the Twitter client used for that tweet.


In [39]:
# Tweets and Retweets by Source
top_20_sources = getKeyFreq(tweetsDF, 'source')[:20]

In [42]:
%%R -i top_20_sources -w 800 -h 600 -u px 
plotTopNAttrs(top_20_sources,'sources','source')


Looking at the chart above

  • Mobile Apps are quite popular but still haven't cought up with the web client (webpage).
  • iPhone and Android are almost equally popular with iPhone slightly ahead.
Top 20 User Locations of twitters.

Next we look at top 20 User Locations

Note A user location is the location a user puts in his profile, this is not the same as a tweets geo location, so if some one puts London, UK as their location and actually tweets from New York, NY, it would still count as a UK tweet for this graph.


In [49]:
# Tweets and Retweets by User Locations
# Note these are locations set in a User Profile, not Geo Locations associated with a Tweet.
top_N_userLocs =  getKeyFreq(tweetsDF, 'user_loc')[1:21]

# TODO Figure out if we can buckets like 'NY', 'New York', 'New York, NY' in a single bucket.

In [50]:
%%R -i top_20_userLocs -w 800 -h 600 -u px 
plotTopNAttrs(top_20_userLocs,'User Locations','user_loc')


Frankly this was bit of a surprise, UK and especially London leads the pack.

There is some data clean up required here, like 'London' and 'London, UK' or "New York" and "New York, NY" should be combined. But regardless London, UK still is ahead of the other locations by a fair bit of margin.

I had no idead London was a hotbed for infosec crowd.

 

Top 20 URLs in All Tweets

It would be interesting to see what URLs people use in their tweets.


In [22]:
# Plot Top 20 URLS in All Tweets.
top_20_urls= getAttrFreq(tweetsDF, 'urls', 'url')[:20]
ggplot(top_20_urls, aes(x='factor(url)', y='count')) + \
        geom_histogram(stat='identity',alpha=0.8) + scale_y_continuous(labels='comma') + \
        theme(axis_text_x=element_text(angle=45, hjust=1)) + xlab('URL') + \
        ylab('Count') + ggtitle('Top 20 URLs in Tweets related to Shellshock\n')


Out[22]:
<ggplot: (703435705)>

I would not have expected cnet.com to be a leading cited source for infosec/vulnerability data. But there is a reason for this, that cnet URL is the one that is mentioned in the most retweeed tweet (the one that got retweeted ~ 11K times), hence it gets to be the leading URL in all the tweets.

Next we look at URLs in only original tweets for a comparison.


In [23]:
# Plot Top 20 URLS in Original Tweets.
top_20_urls= getAttrFreq(uniqtweetsDF, 'urls', 'url')[:20]
ggplot(top_20_urls, aes(x='factor(url)', y='count')) + \
        geom_histogram(stat='identity',alpha=0.8) + scale_y_continuous(labels='comma') + \
        theme(axis_text_x=element_text(angle=45, hjust=1)) + xlab('URL') + \
        ylab('Count') + ggtitle('Top 20 URLs in Tweets related to Shellshock\n')


Out[23]:
<ggplot: (701955425)>

As you can now see, the cnet URL is no where to be found if you just take in to account the original tweets. Here the more traditional infosec focused websites take lead with notable exception of BBC, wired.

Top 20 URL Domains in All Tweets

Next just extract the domains from these URLs and see if anything interesting there.


In [31]:
# Plot Top 20 URLS in unique Tweets.
top_20_domains= getAttrFreq(tweetsDF, 'urlDomains', 'domain')[:20]

In [32]:
%%R -i top_20_domains -w 800 -h 600 -u px 
plotTopNAttrs(top_20_domains,'domains','domain')


Nothing of much interest here, just a bunch of URL shortening services. Next we look at URL Domains in only original tweets for a comparison.

Top 20 URL Domains in Original Tweets

In [33]:
# Plot Top 20 URLS in unique Tweets.
top_20_domains= getAttrFreq(uniqtweetsDF, 'urlDomains', 'domain')[:20]

In [34]:
%%R -i top_20_domains -w 800 -h 600 -u px 
plotTopNAttrs(top_20_domains,'domains','domain')


Top 20 Tweeters

Let's look at the top 20 users in terms of number of tweets related to Shellshock


In [53]:
top_20_users =  getKeyFreq(tweetsDF, 'screen_name')[:20]

In [55]:
%%R -i top_20_users -w 800 -h 600 -u px 
plotTopNAttrs(top_20_users,'users','screen_name')


This needs more drilling into, apparently a lot of accounts were repeat tweeting about 'shellshock', need to find if these were unique tweets or same stuff being tweeted over and over again.


Text Analysis of Tweets

Next let's do some basic text analysis of the tweet texts.


In [56]:
import nltk
from nltk.corpus import cmudict
 
ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS

# for now only unique tweets in English language
uniqtexts = uniqtweetsDF.query('lang == "en"')['text']

t = nltk.tokenize.WhitespaceTokenizer()
p=re.compile(r'[^-#A-Za-z0-9]') # get rid of unwanted characters
tokenLists = uniqtexts.map(lambda text: [p.sub('', word.strip().lower()) 
                                         for word in t.tokenize(text) if 
                                         (not word.lower() in ENGLISH_STOPWORDS) and
                                         ( len(word.strip()) > 1)])
all_tokens = pd.Series([token for tokenList in tokenLists for token in tokenList if 
                        (token is not None) and 
                        (token != '')])
Top 20 words in unique tweets (in English)

In [58]:
# Plot the top 20 tokens.
top_N_words = pd.value_counts(all_tokens)[:20].reset_index()
top_N_words.columns=['token','count']

In [59]:
%%R -i top_N_words -w 800 -h 600 -u px 
plotTopNAttrs(top_N_words,'Words','token')


  • Just like Top 20 Hashtags all the usual suspects are here and once again the absense of 'CVE'.
  • Also note that tweeters prefer words over hashtags (shellshock ~ 85K v/s #shellshock ~ 24K )

Some other Charts

Enough of Bar Charts let's look at some other interesting ways to visualize our data.

Density Plot of Retweet Counts of Original Tweets

Most tweets don't tend to be retweeted very often (baring that one exception of the tweet that got tweeted ~ 11K times).


In [32]:
ggplot(uniqtweetsDF.query('(retweet_count >0)').reset_index(),
       aes(x='retweet_count', color='factor(status)', fill='factor(status)')) + \
       geom_density(alpha=0.2) + scale_x_log10() + \
       theme(axis_text_x=element_text(angle=45, hjust=1)) + \
       xlim(low=1) + xlab('Times Retweeted') + ylab('Density') + \
       ggtitle('Density Plot of Retweet Count of Each Original Tweet')


Out[32]:
<ggplot: (704541713)>
Density Plot of Potential viewers for each Tweet

This plot shows the density plot of potential viewers of this tweet.

NOTE This is just based on immediate number of followers of a user, and does not include followers of followers etc.

As can be seen tweets were seen by < 10K or so users, again confirming that infosec crowd is not as popular as celebrities.


In [31]:
ggplot(tweetsDF.query('(tweeted_to > 10)').reset_index(),
       aes(x='tweeted_to', color='factor(status)', fill='factor(status)')) + \
    geom_density(alpha=0.2) + scale_x_log10() + \
    theme(axis_text_x=element_text(angle=45, hjust=1)) + \
    xlab('Potential Viewers') + ylab('Density (Blue is retweets)') + \
    ggtitle('Density Plot of Potential Viewers of Each Tweet')


Out[31]:
<ggplot: (702880029)>
Retweet Count v/s Number of Followers

So here my assumption was that there should be some positivecorrelation between the number of followers you have and the number of retweets you'll get.

Surprisingly what I found was a bit of negetive correlation, i.e. lower the number of our followers higher the number of retweets. And the only reason this can be is if you have low number of followers but they themselves have high number of followers. In fact the density plot of potential viewers (the one above this one) sort of confirms this.

It was still interesting to see this graphed out.


In [76]:
ggplot(uniqtweetsDF.query('(retweet_count>0) & (tweeted_to<500000)').reset_index(),
       aes(x='tweeted_to', y='retweet_count') )+ \
    geom_point() + scale_y_continuous(labels='comma') + \
    theme(axis_text_x=element_text(angle=45, hjust=1)) + \
    xlab('Potential Viewers') + ylab('Number of Retweets') + \
    scale_y_log10()  + ylim(1,2000) + xlim(0,500000) +\
    ggtitle('Potential Viewers v/s Retweeted tweets related to Shellshock')


Out[76]:
<ggplot: (702786977)>

Map Tweets

This is not as interesting as it sounds as only a handful of tweets have geolocation data attached to them, which is not surprising as most of the contributors here are infosec people. They know better.


In [ ]:
tweetsWithLoc=tweetsDF[pd.notnull(tweetsDF['geo'])].reset_index()
# Get rid of geo points with coords [0.0,0.0]
tweetsWithLoc = tweetsWithLoc[tweetsWithLoc['geo'].map(
                    lambda g: True if ((g['coordinates'][0] != 0)|(g['coordinates'][1] != 0)) else False)].reset_index()
print '\n{0} out of {1} tweets i.e ({2:.2f})% have location data\n'.format(
        len(tweetsWithLoc), len(tweetsDF), (100.0*len(tweetsWithLoc)/len(tweetsDF)))

In [ ]:
js_loader = """
function verifyJSLoaded(){
    /*var jsapiLoaded = (typeof google === 'object' && typeof google.maps === 'object');
    console.log("Google API Loaded: " + jsapiLoaded);
    return jsapiLoaded;*/
    return false;
}

function loadScript() {
  if (!verifyJSLoaded()) {
    console.log('Loading Google API.');
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src = "https://maps.googleapis.com/maps/api/js?sensor=false&libraries=visualization&callback=console.log";
    document.body.appendChild(script);
  }
}

loadScript();
"""

Javascript(js_loader)

In [ ]:
html_template = '<div id="{0}" style="width: 1024px; height: 768px"></div>'
#This is to make sure the JS gets loaded before we try to load the maps
time.sleep(1)

In [ ]:
def gen_javascript(gJSONs, div_id):
    """
    Generates javascript to draw a heatmap with Google Maps API.

    """

    # Creates Javascript objects which will comprise geoData.
    coords = ',\n  '.join(["new google.maps.LatLng(%s, %s)" % tuple(pair) for pair in gJSONs])    
    template_jscript = """
    var geoData = [
    %s
    ];
        
    var map, heatmap;
        
    function hmap_initialize() {
        var mapOptions = {
        zoom: 1,
        center: new google.maps.LatLng(30.5171, 0.1062),
        mapTypeId: google.maps.MapTypeId.SATELLITE
        };
        
        map = new google.maps.Map(document.getElementById('%s'),
              mapOptions);
        
        var pointArray = new google.maps.MVCArray(geoData);
        
        heatmap = new google.maps.visualization.HeatmapLayer({
        data: pointArray
        });
        
      heatmap.setMap(map);
    }
    
    hmap_initialize();
    """
    return template_jscript % (coords, div_id)

In [ ]:
HTML(html_template.format('map_001'))

In [ ]:
jscript = gen_javascript(tweetsWithLoc['geo'].map(lambda x: x['coordinates']),'map_001')
Javascript(jscript)

In [77]:
# All of above works in a localy running notebook but not in nbviewer,
# So here's a static image of what the above code generated
from IPython.display import Image
Image('tweetsOnMap.png')


Out[77]:

In [ ]: